##Question 1:
Overall, we see that the PCA model does a better job of distinguishing
between the red and white wines in comparison to the kmeans clustering
algorithm. The reasoning behind PCA outperforming the Kmeans clustering
is because it compresses the features whereas kmeans clustering
compresses the data points themselves. In the application of this case,
it works better because we are focussing on the differences between the
features.
We then used the PCA method to analyze if it could distinguish between different quality wines. Given the graph above, we deduce that the PCA method is not very accurate at distinguishing the different quality wines.
##Question 2:
“NutrientH20” (pseudonym) wants to understand its social-media audience a little bit better, so that it could hone its messaging a little more sharply.
For the sake of this analysis (based on the pseudonym) we will consider NutrientH20 as a nutrient water brand which is entering the market of flavoured electrolytes.
The dataset includes 36 tweet categories for 7882 users and each cell represents how many times each user has posted a tweet that can be tagged to that category. The categories include…
| x |
|---|
| adult |
| art |
| automotive |
| beauty |
| business |
| chatter |
| college_uni |
| computers |
| cooking |
| crafts |
| current_events |
| dating |
| eco |
| family |
| fashion |
| food |
| health_nutrition |
| home_and_garden |
| music |
| news |
| online_gaming |
| outdoors |
| parenting |
| personal_fitness |
| photo_sharing |
| politics |
| religion |
| school |
| shopping |
| small_business |
| spam |
| sports_fandom |
| sports_playing |
| travel |
| tv_film |
| uncategorized |
The solution here is that the columns have similar items with values for frequency of occurrence, thus, I calculate the term frequencies as % of tweets tagged to a category per user. This normalizes for the difference in number of tweets per user.
Looking at the 4 unwanted categoies - chatter, uncategorized, adult and spam and see the percentage of data filtered when we set a range of cutoffs on the term frequency of that particular category for every user.
Chatter
Adult
Spam
No Category
Here are the cutoffs representing the outliers of our base data:
Then, checked for mutual exclusivity of these rows (taking loss of data into account) and if we remove rows with these features, a loss of 12-13% of the data is incurred, which was deemed a practical trade off for removing a lot of noise from the data, mainly due to these 4 columns
Chatter and no category tweets will not help with clustering, their correlation with any field is assumed as being a coincidence.
Spam and adult are categories not to be desired in clustering.
We perform z-scoring on the dataset and create a grid for number of clusters in KNN to see where the elbow comes in our curve
Given that there is no clear edge, we set our range of k in [3,6] for clustering based on our intuition and testing. Clustering will help pull individual customers in separate groups based on similarities in tweeting patterns.
Principal Component Analysis helps understand the composition of each point as an aggregation of the different numbers and types of tweets. I consider only the first two principal components.
Comparing the results of KNN and PCA.
Different plots from KNN:
Warning: `qplot()` was deprecated in ggplot2 3.4.0.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
Looking at how PC1 and PC2 are formed in terms of categories:
We can see 5 clusters have formed…
Comparing plots for both categories along PC1 and PC2, we can identify the segments
personal_fitness, health_nutrition and outdoors appear close by between PC1=[-0.2,-0.1] and PC2=[-0.45,-0.3]
The 2 clusters above, college_uni and online_gaming interact with other categories that gamers are likely to tweet about.
politics and travel landing close to news, allows us to identify this cluster as traveling people who keep up with current events.
parenting, religion and sports_fandom … all show up along the right of PC1 after food, school and family.
##Question 3:
itemFrequencyPlot(groceries, topN=25, type='absolute')
plot(grocrules)
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(head(sort(grocrules, by="support"), 20),
method="graph", control=list(cex=.9))
Warning: Unknown control parameters: cex
Available control parameters (with default values):
layout = stress
circular = FALSE
ggraphdots = NULL
edges = <environment>
nodes = <environment>
nodetext = <environment>
colors = c("#EE0000FF", "#EEEEEEFF")
engine = ggplot2
max = 100
verbose = FALSE
plot(head(sort(grocrules, by="lift"), 20),
method="graph", control=list(cex=.9))
Warning: Unknown control parameters: cex
Available control parameters (with default values):
layout = stress
circular = FALSE
ggraphdots = NULL
edges = <environment>
nodes = <environment>
nodetext = <environment>
colors = c("#EE0000FF", "#EEEEEEFF")
engine = ggplot2
max = 100
verbose = FALSE
3. Social Media People
Beauty, Cooking and Fashion … all 3 categories are correlated with each other (0.63 - 0.72). While these people might not be focused on a healthy lifestyle in terms of exercise and eating right, they are focused on how they look, social media association fits this mold.